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ABSTRACT 


The  reliability  coeiTicient,  pjjx.  has  long  been  accepted  as  an  index  of  the  stability,  repeatability,  and  precision  of 
psychological  tests.  Because  p^x  measures  the  proportion  of  the  variance  in  a  set  of  scores  attributable  to  variation 
among  individuals,  values  of  pxx  are  sometimes  compared  to  justify  using  particular  tests  in  studies  of  individual 
differences.  Values  of  pxx  urc  also  sometimes  compared  to  justify  using  particular  tests  in  experimental  research. 
The  latter  practice  is  usually  ju.stificd  by  arguing  that  larger  values  of  pxx  imply  greater  measurement  precision  and, 
therefore,  potentially  greater  sensitivity  to  experimental  treatments.  That  argument  is  not  generally  correct  because 
the  individual  variation  measured  by  Pxx  is  frequently  confounded  with  measurement  error  in  the  denominators  of 
significance  tests.  The  effects  of  this  confounding  lead  to  "paradoxical"  situations  in  which  reliability,  as  measured 
by  Pxx.  itttty  be  inversely  related  (or  unrelated)  to  experimental  precision,  as  measured  by  the  reciprocal  of 
experimental  error.  Because  the  power  of  an  experiment  increases  with  precision,  as  just  defined,  conditions  that 
invert  or  negate  the  relationship  between  pxx  and  precision  also  invert  or  negate  the  relationship  between  pxx  and 
power.  These  considerations  do  not  mean  that  the  reliability  coefficient  is  necessarily  irrelevant  to  experimental 
research.  Because  experimental  designs  differ  in  the  degree  to  which  they  are  influenced  by  individual  variation,  a 
consideration  of  the  value  of  pyx  a  specific  test  yields  will  sometimes  provide  information  about  the  best  design  in 
which  to  use  that  test. 
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INTRODUCTION 


The  reliability  coefficient,  p^x.  plays  a  fundamental  role  in  studies  of  individual  differences  because  it 
expresses  the  proportion  of  variance  in  a  set  of  scores  attributable  to  variation  among  individuals  (Gulliksen,  1950). 
l  ienee,  p^x  is  often  described  as  an  index  of  the  precision  or  accuracy  of  tests  (Kcrlingcr,  1986;  Lord  &  Novick, 
1968).  For  these  reasons,  values  of  pj;^  arc  sometimes  compared  to  justify  decisions  to  use  particular  tests  in 
studies  of  individual  differences  (Weiss  &  Davison,  1981). 

Values  of  Pxx  sre  also  sometimes  compared  to  justify  a  decision  to  use  a  particular  psychological  test  in 
experimental  studies,  Common  sense  suggests  that,  if  the  reliability  coefficient  measures  precision,  a  respectable 
value  of  Pxx  should  be  necessary  for  a  test  to  be  sensitive  to  the  effects  of  an  experimentally  manipulated 
independent  variable.  Variants  of  this  proposition  have  been  accepted  by  numerous  authors  (c.g..  Carter,  Krause,  & 
Harbeson,  1986;  Cleary  &  Linn,  1969;  Cook  &  Campbell,  1979;  Flciss,  1976;  Mumphreys  &  Drasgow,  1989a, 
1989b;  NATO  Aerospace  Medical  Panel  Working  Group  12,  1989;  Sutcliffe,  1958).  Unfortunately,  a  policy  of 
using  reliability  coefficients  to  judge  the  relative  sensitivities  of  different  psychological  tests  can  yield  misleading 
results,  Reasons  why  this  is  so  arc  outlined  in  the  remainder  of  this  paper. 

In  the  subsections  that  follow,  I  will  outline  the  statistical  issues  as  they  pertain  to  simple  between-groups 
and  repeutcd-moasurcs  designs.  In  the  Discussion,  I  will  review  the  controversy  surrounding  the  interpretation  of 
the  statistical  results  and  examine  a  widely  held  informal  argument  according  to  which  Pxx  is  directly  related  to 
power.  We  will  sec  that  in  none  of  the  experimental  designs  considered  here  should  a  comparison  of  the  reliability 
coefficients  of  different  psychological  tests  be  expected  to  indicate  which  of  several  different  tests  is  likely  to  be 
more  sensitive  to  the  effects  of  an  experimental  treatment.  On  the  other  hand,  we  will  see  that  a  knowledge  of  one 
test's  reliability  coefficient  can  sometimes  help  an  investigator  ieterminc  the  experimental  design  in  which  that  lest 
will  be  most  sensitive  to  the  effects  of  an  experimental  treatment. 

The  relationship  between  Pxx  and  experimental  power  was  addressed  some  years  ago  by  Overall  and 
Woodward  (1975)  who  offered  the  "paradoxical"  observation  that,  if  measurement  error  is  held  constant,  the  power 
of  an  analysis  of  difference  scores  is  maximized  when  the  reliability  coefficient  of  the  differences  is  zero.  The 
validity  of  their  observation  has  been  obscured  by  the  contentious  and  occasionally  confusing  interchange  that 
followed.  To  understand  the  p.sychometric  basis  of  the  argument,  recall  that  in  classical  lest  theory  an  observed  test 
score,  X|,  is  assumed  to  be  the  sum  of  a  true  score.  T„  and  a  measurement  error,  E,  (e.g.,  Gulliksen,  1950). 
Measurement  errors  arc  assumed  lo  be  random  with  a  mean  of  zero,  and  to  be  independent  of  the  true  scores  and  of 
each  other.  Hence  the  variance  of  a  set  of  test  scores  is  a  sum  of  truc-seore  and  measurement-error  variances;  i.e.. 


Ox'  =  o.,.'  +  o,. 


(1) 


where  Ox^  is  the  variance  of  the  observed  scores,  o.,.’  the  variance  of  the  true  scores,  and  o,..‘  the  variance  of  the 
measurement  errors.  The  reliability  coefficient,  in  turn,  is  the  proportion  of  the  scores'  variance  attributable  to 
variance  in  true  scores  (e.g.,  Gulliksen,  1950).  That  is: 


Px; 


(O.,.*  +  o^,') 


(2) 
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Perhaps  Ihc  most  familiar  estimate  of  is  the  test-retest  eorrclation,  which  is  determined  by  obtaining  seorcs  from 
the  same  individuals  on  two  occasions  and  calculating  the  Pearson  product-moment  correlation  between  first  and 
second  scores. 


BETWEEN-GROUPS  CONTRASTS 

First,  consider  a  simple  between-groups  design  in  which  a  psychological  test  is  administered  to  treatment 
and  control  groups  to  measure  the  effect  of  an  experimentally  manipulated  independent  variable.  Under  the  null 
hypothesis  that  group  means  are  equal,  the  independent-samples  i  test  for  groups  of  equal  sizes  can  be  written: 


X,  -X, 


(3) 


in  which  X,  and  Xj  are  the  group  means,  dj;,'  and  are  the  estimated  within  group  variances,  and  n  is  group  size. 
Suppose,  for  simplicity,  that  true  score  and  measurement  error  variances  do  not  differ  between  groujis  If  we 
replace  liquation  3  with  the  right  side  of  liquation  1 ,  the  equation  for  /  becomes: 


X,-X, 

1 2  (ft/ +  A,,'.) 


(4) 


Equation  4  indicates  that,  if  the  difference  X,  -  X,  is  constant,  the  value  oft  (and,  therefore,  the  sensitivity  of  the 
test )  varies  inversely  with  the  summed  magnitudes  of  the  true  and  measurement-error  variances.  Furthermore, 
Equation  4  indicates  that,  in  designs  for  which  Equation  3  is  appropriate,  the  value  of  Student's  /  (and  the  sensitivity 
of  the  test)  will  be  unrelated  to  the  relative  magnitudes  of  true  and  measurement-error  variances.'  If  the  value  of  t  is 
unrelated  to  the  relative  magnitudes  of  true  and  measurement-error  variances  in  this  design,  t  must  also  bo  unrelated 
to  the  value  of  the  reliability  coefficient  in  this  design.  This  is  because,  as  Equation  2  indicates,  the  reliability 
coefficient  is  determined  by  the  relative  magnitudes  of  true  and  error  variances  (Williams  &  Zimmerman,  1989; 
Zimmerman  &  Williams,  1986).  /\n  analogous  derivation  for  the  analysis  of  variance  has  been  presented  by 
Nicewander  and  Price  ( 1 978).^ 

To  summarize:  In  between-groups  experimental  designs  for  which  Equation  3  is  appropriate,  tests  with 
equal  total  variances  yield  equal  power;  tests  with  different  total  variances  yield  different  levels  of  ]iower.  Because 


'One  might  argue  that  it  would  be  more  appropriate  to  p.'irase  Equations  3-7  in  effect-size  notation  rather 
than  /-lest  notation.  1  have  used  /-lest  notation  because  to  u.se  effect-size  notation  would  reduce  clarity  without 
affecting  the  conclusions, 

Mt  is  worth  noting  that  Sutcliffe  (1980)  'ilso  presented  ar.  equivalent  deri\ation  for  the  analysis  of  \  uriance, 
in  an  article  occasionally  cited  as  supporting  the  idea  that  expi 'iinental  ]iower  necessarily  increases  with 
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power  varies  with  the  sum  ofa-,.^  and  a,,’,  not  their  relative  magnitudes,  power  in  between-groups  designs  is 
unrelated  to 


REPEATED-MEASURES  CONTRASTS 

Repcatcd-mcasurcs  contrasts  without  subjcct-by-trcahncnt  interactions 

Repeated-measures  designs  present  somewhat  difl'erent  issues.  Overall  and  Woodward  (1975)  considered 
the  ease  of  difference  scores  calculated  by  subtracting  subjects'  scores  in  one  experimental  condition  from  their 
scores  in  another.  Difference  scores  of  this  type  form  the  basic  data  of  t  tests  for  correlated  observations.  Overall 
and  Woodward  (1975)  considered  u  model  in  which  subjects  do  not  differ  in  their  responses  to  the  experimental 
*  treatment  (i.e.,  a  model  in  which  no  subject-by-treatment  interaction  occurs).  When  the  null  hypothesis  is  that  the 

average  difference  between  scores  in  two  experimental  conditions  is  zero,  the  equation  for  the  correlated  /  test  can 
be  written: 


d 

t  = -  (5) 


whero  d  is  the  mean  difference  score  and  dj  is  its  estimated  standard  error.  If  individual  differences  are  assumed 
equal  in  the  two  experimental  conditions  (the  usual  assumption),  variance  attributable  to  individuals  disappears 
from  the  variance  of  the  difference  scores  (Overall  &  Woodward,  1975).  lienee,  the  variance  of  the  differences  is 
simply  the  summed  measurement-error  variance  of  the  original  scores;  i.e.,  Oj  =  2o/,  where  o,,^  is  the  mea.suremcnt 
error  variance  of  the  original  .scores.  Thus,  the  /  for  correlated  observations  can  be  written: 


I 


d 

(2  A/ /«)''* 


(6) 


An  examination  of  Equation  6  indicates  that  if  d  and  n  are  tre-.wd  as  constants,  the  magnitude  oft  and,  therefore,  the 
power  of  a  test  of  dilTerencc  scores  depends  only  on  the  magnitude  of  (Overall  &  Woodward,  '975,  1976). 

Therefore,  if  one's  goal  is  to  select  the  psychological  test  with  the  greatest  power,  and  one  has  no  reason  to 
suppose  that  one  of  the  tests  under  consideration  will  yield  a  larger  average  difference,  one  should  sclec'.  the  test 
with  the  smallest  value  of  o,.T  The  relevance  of  p^x  *o  power  in  this  example  depends  on  the  rclationshij)  between 
6,;^  and  pxx.  !•  follows  from  Equation  2  that  this  relationship  is  =  (1  -  pxx)''x^'  Hence,  the  reliability  coefficients 
of  the  original  test  scores  arc,  indeed,  relevant  to  power  in  tests  based  on  difference  scores  (a  point  made  by  Overall 


■'Many  authors  have  pointed  out  that  it  is  possible  to  .specify  conditions  under  which  differences  in  the 
relative  sizes  of  a.|.'und  Oi.'  lead  to  systematic  changes  in  both  reliability  and  power.  The  most  important 
example  occurs  when  cither  a^-^or  a,.;'  remains  constant  from  one  test  to  the  nex*  and  the  other  varies. 
Comparing  Equations  2  and  4,  one  can  sec  that  if  remains  constant  while  a,,'  varies,  Pxx  will  vary  directly 
with  t  (assuming,  of  course,  the  numerator  of  Equation  4  remains  constant).  The  opposite  result  is  obtained  if 
CTj;^  remains  constant  while  varies.  In  this  ease,  t  varies  indirectly  with  pxx'  Whether  it  is  ever  plausible  to 
assume  that  either  or  CTy'  will  remain  constant  from  one  test  to  the  next  is  an  open  question. 
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and  Woodward,  1975;  1976).  The  relevance  of  p^x  to  power  in  this  case  is  that  the  additional  power  afforded  by 
using  a  specific  test  in  a  repeated-measures  design  (rather  than  in  a  between-groups  design)  increases  directly  with 
the  value  of  pxx-  This  docs  not  mean  that  comparing  the  reliability  coefficients  of  two  tests  will  indicate  which  test 
will  yield  the  more  powerful  contrasts:  Because  and  can  both  be  expected  to  vary  from  one  test  to  the  next, 
there  will  ordinarily  be  no  reason  to  suppose  that  the  test  with  the  largest  value  of  0^.^  /  +  o,,^)  will  have  the 

smallest  value  of  o,,’  (an  exception  to  this  generalization  is  outlined  in  the  Discussion).  Thus,  simple  comparisons 
of  the  reliability  coefficients  of  two  tests  should  not  be  expected  to  indicate  which  test  is  more  powerful. 

Overall  and  Woodward  were  primarily  concerned  with  showing  that  difference  scores,  although  frequently 
pcsscssing  low  reliability  coefficients,  do  not  necessarily  yield  contrasts  of  low  power.  They  drew  the  seemingly 
paradoxical  conclusion  that  (when  measurement  error  is  held  constant)  "the  value  of  the  test  statistic  is  maximized 
when  the  reliability  of  the  difference  scores  is  zero"  (Overall  and  Woodward,  1975,  p.  86).  To  understand  this 
conclusion,  note-  that  the  reliability  and  total  variance  of  a  set  of  difference  scores  will  both  increase  if  variance 
attributable  to  individual  differences  is  added  to  the  error  variance  of  the  difference  scores.  However,  the  resulting 
increase  in  total  variance  would  inllatc  the  denominator  of  liquation  6,  thereby  reducing  t  and  experimental  power. 
Of  course,  the  conclusion  that  reliability  varies  inversely  with  power  when  is  held  constant  docs  not  mean  that 
tests  that  yield  unreliable  difference  seores  will  ncecssarily  yield  more  powerful  contrasts  than  tests  that  yield 
reliable  difference  scores  (Overall  and  Woodward  never  implied  that  this  would  be  true).  This  is  because  can  be 
expected  to  vary  from  one  psychological  test  to  the  next. 


Repcatcd-incasure.s  with  subjcet-lty-tieatment  interactions 

Floiss  (1976)  argued  that  the  analysis  of  Overall  and  Woodward  (1975)  was  bused  on  the  unrealistic 
assumption  that  individuals  do  not  vary  in  their  responses  to  independent  variables.  Flciss  considered  an  alternative 
repeated-measures  model  in  which  subjects  may  differ  in  their  responses  to  the  independent  variable.  When 
subjects  differ  in  the  way  they  respond  to  an  independent  variable,  the  variances  of  difference  scores  become 
2oe’  +  rather  than  as  in  Equation  6  (Flciss,  1976;  Sutcliffe,  1980).  The  new  term,  represents 
variance  attributable  to  a  ,subjeot-by-trcutmcnt  interaction."'  Hence,  the  equation  for  the  correlated  t  test  in  the 
presence  of  a  such  an  interaction  might  be  rewritten; 


d 

l(2au^ +  4aj) /«!''• 


(7) 


Flciss  argued  that  when  the  subject-by-treutment  interaction  variance  is  held  constant,  power  is  maximized  when 
the  reliability  coefficient  of  the  difference  scores,  p,,,,,  is  maximized,  not  when  pjj  =  0.  The  reliability  coefficient  of 
the  difference  scores  can  be  understood  to  be  the  proportion  of  the  total  variance  of  the  differences  attributable  to  a 
subjcct-by-treatmcnt  interaction  (Flciss,  1976;  Sutcliffe,  1980).  Examining  Equation  7,  one  cun  see  that  reducing 
measurement  error  while  holding  the  interaction  variance  constant  will  cause  I  to  increase.  The  reliability  of  the 
differences  will  also  increase  bceuusc  the  reduction  in  measurement  error  will  reduce  the  total  variance  of  the 
differences,  thereby  increasing  the  proportion  of  the  variance  attributable  to  the  interaction.  Hence,  power  will 
increase  directly  with  the  reliiibility  of  the  difference  scores  when  the  stihji'ci-hy-irealmeni  interaction  is  held 
constant  and  measurement  error  is  allowed  to  vary. 


^Flciss  (1976)  and  Sutcliffe  (1980)  defined  as  the  within-ccll  interaction  variance.  Nicewander  and 
Price  (1983)  have  pointed  out  that,  in  discussions  of  ANOVA,  is  con\'cntionally  used  to  refer  to  sums  of 
within-ccll  interaction  variances  (e.g.,  Scheffe,  1959),  in  which  ease  the  variance  of  the  differences  becomes 
2(0(„^  +  0^7).  1  will  follow  F'lciss  and  Sutcliffe's  usage,  here,  for  consistency  with  the  argument  at  hand. 
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Overall  and  Woodward  (1976)  ac);nowlcdgcd  Fleiss's  point  but  noted  that  h’lciss  hud  failed  to  address 
theirs.  Their  point  had  been  that  the  redi  ction  in  reliability  that  occurs  when  constant  individrul  differences  are 
removed  by  calculating  difference  scores  does  not  imply  that  contrasts  based  on  difference  s(  ores  must  have  low 
power.  It  was  logical  for  Overall  and  Woodward  to  approach  this  issue  by  considering  the  effects  of  reducing  o./ 
when  0|/  is  held  constant  because  the  process  of  calculating  difference  scores  eliminates  constant  individual 
differences  that  contribute  to  aJ  without  affecting  the  random  errors  that  produce 

Thus,  in  the  case  of  repeated-measures  with  subjcct-by -treatment  interactions,  if  one's  goal  is  to  select  the 
test  w'ith  the  greatest  power,  and  there  is  no  reason  to  suppose  that  one  of  the  tests  under  consideration  will  yield  a 
larger  mean  difference  between  conditions,  the  best  strategy  is  to  select  the  test  with  .he  smallest  value  of 
2o,.;^  +  4a„/.  However,  the  process  of  estimating  the  subjcct-by-trcutment  interaction  would  require  one  to  obtain 
data  in  the  experimental  conditions  of  interest.  With  that  information  in  hand,  one  could  simply  calculate  values  of 
I  (or,  better  yet,  effect-si/e  estimates)  for  each  candidate  lest  and  use  these  values  to  compare  the  tests'  sensitivities 
directly. 


DISCUSSION 

The  considerations  just  outlined  suggest  that  in  practical  situations  an  investigator  should  not  expect  that  a 
simple  comparison  of  the  reliability  coefficients  of  two  different  psychological  tests  will  reveal  whether  one  of  the 
tests  is  likely  to  yield  more  powerful  or  precise  measurements  of  the  effects  of  an  experimentally  manipulated 
independent  variable.  Except  in  special  cases,  the  reliability  coefficients  of  different  tests  need  not  be  directly 
related  to  the  magnitudes  of  error  terms  derived  from  scores  on  those  tests  (Nicewander  &  Price,  1978,  1983; 
Sutcliffe,  1980;  Williams  &  Zimmerman,  1989). 

An  important  special  ;asc  in  which  power  and  precision  are  directly  related  to  p>;>;  occurs  when  an 
investigator  compares  two  tests  that  produce  identical  true  scores  but  different  measurement  error  variances 
(Nicewander  &  Price,  1978).  If  two  tests  produce  the  same  true  scores,  the  tests  will  also  produce  the  same  values 
of  Hence,  the  test  with  the  smaller  value  of  o,,’  will  be  more  reliable  and  more  powerful  (consider  Equations  2 
and  4).  A  state  of  affairs  like  this  can  occur  in  practice  when  one  .succeeds  in  increasing  by  increasing  test 
length,  An  increase  in  test  length  will  sometimes  reduce  the  innucncc  of  measurement  error,  thereby  increasing 
both  pxx  and  power.  Nicewander  and  Price  (1983)  suggest  that  the  familiar  practice  of  increasing  test  length  to 
increase  reliability  and  power  may  be  the  source  of  the  belief  that  greater  reliability  is  always  associated  with 
greater  power;  Nicewander  and  Price,  however,  ahso  present  a  numerical  counterexample  in  which  an  increase  in 
test  length  brought  about  by  adding  blocks  of  slightly  nonparallcl  trials  increases  p^x  but  reduces  power. 

Despite  the  straightforward  nature  of  these  results,  the  relationship  between  p^x  and  experimental  power 
has  remained  controversial.  Some  investigators,  without  contesting  the  statistical  results,  have  objected  to  the  broad 
conclusion  (as  it  has  sometimes  been  phrased)  that  reliability  and  power  arc  unrelated  in  experimental  studies  (e.g,, 
Humphreys  &  IDrasgow,  1989a,  1989b;  Sutcliffe,  1980).  This  objection  is  compelling  because  the  reliability 
coefficient  of  a  test  is,  in  fact,  relevant  to  the  issue  of  which  experimental  design  will  afford  the  most  powerful 
contrasts  of  scores  from  that  specific  test  For  example,  if  an  experimental  treatment  simply  adds  a  constant  to  each 
score,  the  power  of  a  within  subjects  design  will  exceed  the  power  of  a  between  subjects  design  by  an  amount  that 
increases  directly  with  pxx  (recall  the  discussion  of  Equation  6). 

A  related  objection  derives  from  the  philosophical  idea  thiit  "teliability  of  measurement,"  properly  defined, 
should  be  directly  related  to  the  power  of  experiments.  Nicewander  and  Price  (1983)  ha\  e  noted  that  portions  of 
Sutcliffe's  (1980)  argument  that  reliability  and  power  ate  directly  related  appear  to  imply  that  reliability  might  be 
more  appropriately  defined  us  a  function  of  the  reciprocal  of  measurement  error.  Nicewander  and  Price  (1983) 
suggested  that  such  a  redefinition  would  sohe  the  problem  for  contrasts  bused  on  difference  scores.  However, 
consideration  of  Equations  4  and  7  indicates  that  defining  "reliability"  as  a  function  of  1  /  n,..‘  would  not  necessarily 


5 


make  exporimcnlul  power  a  direct  function  of  "reliability"  when  Ot"  or  differ  from  one  test  to  the  next. 
Humphreys  and  Drusgow  (1989a,  1989b),  in  contrast,  have  suggested  that  it  would  be  possible  to  ensure  that  the 
reliabilities  of  difference  scores  always  vary  directly  with  power  by  incorporating  the  magnitudes  of  treatment 
effects  into  the  definition  of  tl'.e  coeffieient.  However,  as  Overall  (1989)  noted  in  a  response  to  Humphreys  and 
Drasgow,  redefining  the  reliability  coefficient  as  an  index  of  effect  size  would  require  abandoning  the  long-standing 
tradition  of  interpreting  the  reliability  coefficient  as  a  measure  of  sensitivity  to  indiviilual  differences. 

An  informal  argument  sometimes  offered  to  support  the  idea  that  larger  values  of  tend  to  be  associated 
with  greater  power  is  based  on  the  notion  that  tests  with  relatively  high  values  of  pyx  are  relatively  sensitive  to 
individual  variation  in  true  scores.  According  to  this  argument,  if  a  test  is  relatively  sensitive  to  variation  in  true 
.scores,  it  should  also  be  relatively  sensitive  to  experimentally  induced  changes  in  true  scores.  This  logic  is  correct 
if  an  additional  assumption  is  \'alid.  The  additional  assumption  is  that  true  scores  on  the  tests  being  compared  bear 
equivalent  functional  relationships  to  the  same  set  of  underlying  psychological  variables.  When  this  assumption  is 
valid,  values  of  o.,’  obtained  from  all  tests  should  be  equal,  and  the  ax  erage  effect  of  experimental  treatments  on  true 
scores  should  also  be  equal.  When  these  conditions  hold,  the  test  with  the  largest  reliability  coefficient  will  be  the 
test  with  the  smallest  Assuming  that  all  other  factors  are  equal  (as  they  should  be  under  these  assumptions),  the 
test  with  the  smallest  value  of  aj,'  will  yield  the  most  powerful  contrasts  (see  lu|uations  4,  6.  and  7). 

Some  hazards  of  applying  the  logic  ju.st  described  to  tests  that  are  not  essentially  variants  of  a  single  test 
follow  from  possibilities  thiit:  (1)  the  underlying  psychological  processes  that  determine  the  true  components  of 
test  scores  may  differ  from  one  test  to  the  next  and/or  (2)  the  functions  that  relate  true  scores  to  underlying 
processes  may  differ  from  one  test  to  the  next.  To  illustrate  how  such  comparisons  can  go  awry,  suppos.  that  tests 
A  and  13  have  equal  measurement  error  but  that  true  scores  on  A  are  determined  by  psychological  attributes,  Y1  and 
Y2,  svhereas  true  scores  on  B  are  determined  only  by  Yl.  If  so. 


Pxx(A)  =  - 


’yT  +  "vY 


ly,-  +  OyY 


(9) 


whereas 


nxx(B) 


Oy,  +<T^- 


(10) 


where  pxx'./'')  reliability  of  test  A,  Pxx(B)  is  die  reliability  of  test  13,  nyi"  is  true  \  ariance  due  to  attribute  1,  oy, 
is  true  variance  due  to  attribute  2,  and  a|.’  is  measurement  error.  Because  measurement-error  variances  are  equal, 
Test  A  is  more  reliable  than  test  13  because  its  scores  contain  more  true  luriance  relative  to  error.  (The  additional 
true  variance  in  scores  from  A  is  the  variance  attributable  to  Y2.)  Although  A  is  more  reliable  than  13,  tlie  tests' 
relative  sensitivities  to  experimental  treatments  will  be  impossible  to  predict  if  one  does  not  know  the  causal 
structures  that  generate  their  scores. 

Tor  example,  lest  A  will  olwiously  be  more  sensitive  than  test  B  to  treatmenis  that  affect  only  Y2  because 
test  13  is  unaffected  by  changes  in  Y2.  In  between-groups  designs,  howexer,  the  less  "reliable"  test  13  will  be  more 
sensitive  than  test  A  to  treatments  that  affect  only  Yl.  This  is  bccau.se  the  error  terms  of  between-groups 
significance  tests  derived  from  scores  on  test  13  will  equal  i2(AY,^  +  A,')  /  n|''y  whereas  those  derived  from  test  A 
will  equal  |2(Ay|'  +Ay;‘  +  A,.')  /  n]'''  (cf.  liquation  4).  Therefore,  the  error  terms  ol' between-groups  significance 
tests  derived  from  scores  from  test  A  will  be  inilated  by  irrelevant  true  variation  in  Y2.  On  the  other  haiul.  tests  A 
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and  B  will  yield  equally  powerful  contrasts  of  treatments  that  affect  only  Y  1  when  they  are  used  in  within  subjects 
designs.  This  is  because  the  two  tests  yield  equal  within  subjects  error  terms,  12((^,’)  /  n|''‘  (sec  liquation  6). 

As  these  examples  illustrate,  when  it  is  incorrect  to  assume  that  ’.csts  being  compared  bear  equivalent 
functional  relationships  to  the  same  set  of  underl>ing  psychological  variables,  p^x  and  power  may  be  directly 
related,  inversely  related,  or  unrelated.  The  direction  and  form  of  the  relationship  in  any  particular  experiment  will 
depend  on  the  design  of  the  experiment  and  the  nature  of  the  causal  structure  linking  the  independent  and  dependent 
variables. 

Although  comparing  values  of  p^x  to  justify  the  use  of  particular  tests  m  experimental  studies  is  hazardous, 
a  test's  reliability  coefficient  can  still  be  relevant  to  the  power  of  the  experimental  design  in  which  the  test  is  used. 
I'or  example,  the  power  of  a  within  subjects  contrast,  relative  to  that  of  a  between  subjects  contrast,  varies  directly 
with  the  correlation  between  subjects'  scores  in  the  different  treatment  conditions.  As  Overall  and  Woodward 
(1975)  have  pointed  out,  this  correlation  equals  Pxx  if  the  independent  variable  simply  adds  a  constant  to  each  score 
(i.e.,  if  subjects  and  treatments  do  not  interact).  I'urthcrmorc,  knowledge  of  Pxx  can  be  sometimes  be  used  to  judge 
whether  power  can  be  increased  more  efficiently  by  adding  a  pretest  and  using  its  scores  as  covariates  or  by 
increasing  the  length  of  the  posttest  (Maxwell,  Cole,  Arvey,  &  Salas,  1991)  Moreover,  knowledge  of  the  prepost 
correlation  (which  equals  pxx  if  subjects  and  treatments  do  not  interact)  can  be  used  to  judge  whether  a  between- 
groups  contrast  of  iirctcst-posttest  difference  scores  will  be  more  or  less  powerful  than  a  simple  contrast  of  posttest 
scores  (Humphreys  &  Drasgow,  1989a,  Kraemer  &  Thiemann,  1987;  Overall  &  Ashby,  1991). 


RECOMMENDATIONS 

1.  The  reliability  coefficient  of  psychometric  theory,  pxx,  should  not  be  used  as  a  surrogate  effect-size  e.stimatc 
when  selecting  a  test  for  use  in  a  true  experiment.  This  recommendation  does  not  apply  to  (possibly  rare) 
comparisons  among  tests  that  differ  only  in  mca.suremcnt  error  (d,A  as  defined  m  psychometric  theoiy),  nor  does  it 
apply  to  noncxperimental  research  in  which  individual  variation  is  a  focus  of  interest 

2.  The  reliability  coefficient  of  a  test  can  sometimes  be  used  to  select  the  most  powerful  experimental  design  for 
use  with  that  test.  Mcnce,  general  statements  to  the  effect  pxx  is  irrelevant  to  experimental  research  arc  incorrect. 
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